class: center middle # Graphical Descriptive Statistics Michael Luu, MPH | Marie Lauzon, MS Biostatistics & Bioinformatics Research Center | Cedars Sinai Medical Center #### September 14, 2021 ??? * Hi Everyone - My name is Michael Luu, and I'm a Research Biostatistician here at the Biostatistics Core as part of Cedar Sinai Cancer Center. Today I will be talking about graphical descriptive statistics. The objective of my talk is to provide a motivating example on the need to visualize our data, as well as provide some of the basic tools and types of simple figures you can use to accomplish this goal. * So to start off my talk, I want to provide a motivating example - --- class: inverse center middle # Why do we need to visualize our data? --- .pull-left[ |dataset |x |y | |:-------|:-----|:-----| |A |55.38 |97.18 | |A |51.54 |96.03 | |A |46.15 |94.49 | |A |42.82 |91.41 | |A |40.77 |88.33 | <br> |dataset |x |y | |:-------|:-----|:-----| |B |58.21 |91.88 | |B |58.20 |92.21 | |B |58.72 |90.31 | |B |57.28 |89.91 | |B |58.08 |92.01 | ] .pull-right[ |dataset |x |y | |:-------|:-----|:-----| |C |38.34 |92.47 | |C |35.75 |94.12 | |C |32.77 |88.52 | |C |33.73 |88.62 | |C |37.24 |83.72 | <br> |dataset |x |y | |:-------|:-----|:-----| |D |55.99 |79.28 | |D |50.03 |79.01 | |D |51.29 |82.44 | |D |51.17 |79.17 | |D |44.38 |78.16 | ] ??? * What I'm showing here are 4 unique datasets - dataset A, B, C, and D. * Of note, these are truncated versions of the full dataset, where i'm only showing the first 5 rows due to space. * Each dataset has 3 variables, the dataset variable, X, and Y * X, and Y are both continuous variables as seen on this slide --- class: inverse center middle # Let's begin by taking descriptive measures ??? * So, let's begin by taking some descriptive summary measures of the four datasets as discussed by my colleauge Marie --- # Quantitative Summaries .center[ |dataset |n |mean_x |sd_x |mean_y |sd_y | |:-------|:-----|:------|:----|:------|:----| |A |142.0 |54.3 |16.8 |47.8 |26.9 | |B |142.0 |54.3 |16.8 |47.8 |26.9 | |C |142.0 |54.3 |16.8 |47.8 |26.9 | |D |142.0 |54.3 |16.8 |47.8 |26.9 | ] ??? * What you're seeing here is a small table with some basic summary measures like counts (n), mean, and standard deviation. We have summary measures for each of the four datasets as seen on each rows. * For example, the first row is showing the descriptive measures for dataset A, followed by B, C, and D. * For dataset A, we have 142 rows, with a mean of 54.3 for X, 47.8 for Y, with the SD of X as 16.8 and the SD of Y as 26.9 * You may have already noticed the obvious... * The summary measures are identical for all four datasets! -- * It appears the counts (n), mean (x), mean (y), and sd (x) and sd (y) are identical for ALL four datasets! --- class: inverse center middle # Can we conclude the datasets are similiar or identical? ??? * So what can we conclude about the four datasets ? --- class: center middle # Not quite yet! --- class: inverse center middle # Let's visualize the relationship of x and y --- # Dataset A .center[ <img src="data:image/png;base64,#descriptive_statistics_files/figure-html/unnamed-chunk-8-1.png" width="100%" /> ] ??? * What you are seeing is a scatter plot of X and Y for dataset A. In short, a scatter plot allows us to visualize the relationship between two continuous variables. We have X on the x-axis, and Y on the y-axis. Each point represents a single observation with the corresponding X and Y values. When visualized we have this plot of a dino. --- # Dataset B .center[ <img src="data:image/png;base64,#descriptive_statistics_files/figure-html/unnamed-chunk-9-1.png" width="100%" /> ] ??? * Again, when visualizing dataset B, we have a plot of a star --- # Dataset C .center[ <img src="data:image/png;base64,#descriptive_statistics_files/figure-html/unnamed-chunk-10-1.png" width="100%" /> ] ??? * For dataset C, we have an X --- # Dataset D .center[ <img src="data:image/png;base64,#descriptive_statistics_files/figure-html/unnamed-chunk-11-1.png" width="100%" /> ] ??? * And for dataset D, we have a circle --- class: center middle <img src="data:image/png;base64,#images/blow-mind-mind-blown.gif" width="75%" /> ??? * You might be thinking, what is going on!? I just showed you four different datasets with identical or near identical summary measures, yet we have these unique and interesting figures... --- class: center middle <img src="data:image/png;base64,#images/DinoSequentialSmaller.gif" width="100%" /> ??? * This phenomenom expands to more than just the 4 example datasets that I showed, but we have 13 datasets as seen on this slide that have near identical summary measures when visualized. * So what is the take away from this rather unique and exaggerated example ? --- class: inverse center middle # Although simple quantitative summaries are similar ... --- class: center middle # They can appear drastically different when visualized! --- # Datasaurus Dozen * The original "Datasaurus" or "dino" was created by **Alberto Cairo** in the following [blog post](http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html) * He was then later made famous by the paper published by **Justin Matejka** and **George Fitzmaurize**, titled ['Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing'](https://www.autodesk.com/research/publications/same-stats-different-graphs), where they simulated 12 additional datasets in addition to the original "Datasaurus" with nearly identical simple statistics ??? * Now a little background on the dataset that I showed - The dataset is called the datasaurus dozen --- class: middle .center[ <img src="data:image/png;base64,#descriptive_statistics_files/figure-html/unnamed-chunk-14-1.png" width="75%" /> ] ??? * The following is the original 'dino' or datasaurus along with the 12 additional datasets that were carefully simulated to have near identical summary measures * The method they used to create these datasets is out of the scope of this talk, however the important take away is the emphasis on the need to visualize your data. * As seen on this slide, simple summary measures are useful, however they CAN be deceiving when you're not careful. Plots and figures are able to describe your data in a completely new dimension that is not being captured by simple summary measures. --- # Anscombe Quartet * The datasaurus dozen is a modern take on the classical **"Anscombe's Quartet"** - Anscombe, F. J. (1973). "Graphs in Statistical Analysis". American Statistician. 27 (1): 17–21. doi:10.1080/00031305.1973.10478966. JSTOR 2682899 - Comprised of four datasets that have nearly identical simple summary measures, yet have very different distributions and appear vastly different when plotted --- class: # Anscombe Quartet .center[ |dataset |n |mean_x |sd_x |meay_y |sd_y | |:-------|:-----|:------|:----|:------|:----| |I |11.00 |9.00 |3.32 |7.50 |2.03 | |II |11.00 |9.00 |3.32 |7.50 |2.03 | |III |11.00 |9.00 |3.32 |7.50 |2.03 | |IV |11.00 |9.00 |3.32 |7.50 |2.03 | ] .center[ <img src="data:image/png;base64,#descriptive_statistics_files/figure-html/unnamed-chunk-16-1.png" width="100%" /> ??? * Describe how if we use a regression and how it can be affected ] --- class: inverse middle center # Types of Graphical Visualizations ??? * Now I would like to go into some detail on different ways to graphically summarize your data --- # Dot plot .pull-left[ * Useful for small to moderate sized data * Allows us to visualize the spread and distribution of one continuous discrete variables * e.g. length of stay * The X axis is the variable of interest and each dot represents a single observation * Easy to identify the mode * Highlights clusters, gaps, and outliers * Intuitive and easy to understand ] .pull-right[ <img src="data:image/png;base64,#descriptive_statistics_files/figure-html/unnamed-chunk-17-1.png" width="100%" /> ] ??? * One of the most basic visualizations that allows you to get a sense of the data is the dot plot. <!-- * If we take another sample - the figure may look difference --> --- # Histogram .pull-left[ * Useful for all sized data (small and large) * Allows us to visualize the spread and distribution of continuous variables * Each bar represents a 'bin' or a defined interval of values * Although not as common, the width of the bins does NOT have to be equal! * The y axis or the height of the bar represents the count of the number of values that fall into each bin * The y axis is also commonly normalized to 'relative' frequencies to show the proportion of cases or density that falls into each bin. ] .pull-right[ <img src="data:image/png;base64,#descriptive_statistics_files/figure-html/unnamed-chunk-18-1.png" width="100%" /> ] ??? --- # Distribution > "A distribution is simply a collection of data, or scores, on a variable. Usually, these scores are arranged in order from smallest to largest and then they can be presented graphically." — Page 6, Statistics in Plain English, Third Edition, 2010. <img src="data:image/png;base64,#descriptive_statistics_files/figure-html/unnamed-chunk-19-1.png" width="50%" style="display: block; margin: auto;" /> ??? * In statistics we also have something called a probability distributions, in which we understand the probability of occurrences of a random values - assuming they follow a specific distribution. * The values as seen in the histogram below is from the most well understood normal probability distribution, or 'normal distribution' that resembles a bell shaped curve * Histograms allows us to better understand the distribution or shape of our data. * One of the highlighted properties of the normal distribution includes the peak of the bell shaped curve to indicate the location of the mean, median, and mode. --- # Normal Distribution <img src="data:image/png;base64,#images/normal_distribution_figure.png" width="65%" style="display: block; margin: auto;" /> ??? * As mentioned previously, if we understand the distribution or shape of our data and assume it follows a specific distribution like the normal distribution. We understand the frequency or probability of occurrences of such random values. * For example, 68.2% of the values occur within plus and minus 1 standard deviation away from the mean. 95.4% of the values occur within two standard deviation away from the mean and 99.7% of the values falls within 3 standard deviations away from the mean. <!-- * Add the numbers e.g. 68.27 on the plot --> --- # Univariate Continuous Distributions <img src="data:image/png;base64,#images/univariate_continuous_distributrions.PNG" width="50%" style="display: block; margin: auto;" /> ??? * The normal distribution is just one of many probability distributions as visualized in the following figures * Shown on this slide are just a small snapshot of the distributions that have been studied * Some of the notable distributions on this slide include the 'Normal distribution' which we just discussed, as well as the Student T distribution. --- # Univariate Discrete Distributions <img src="data:image/png;base64,#images/univariate_discrete_distributions.PNG" width="50%" style="display: block; margin: auto;" /> ??? * Again, Among the univariate discrete distributions, some of the notable ones include the Binomial and the Poisson distributions. --- # Scatter plot .pull-left[ * Used to visualize the relationship between two continuous variables * Useful for detecting patterns that are obscured from quantitative summaries like what we observed in Anscombe's quartet and the Datasaurus dozen. ] .pull-right[ <img src="data:image/png;base64,#descriptive_statistics_files/figure-html/unnamed-chunk-23-1.png" width="100%" /> ] ??? * Next I wanted to reintroduce the scatter plot - --- # Bar plot .pull-left[ * Useful for visualizing **categorical** data * Commonly used to present counts and proportion of each level * Allows us to quickly observe the difference in magnitude of each level based on the height of each bar ] .pull-right[ <img src="data:image/png;base64,#descriptive_statistics_files/figure-html/unnamed-chunk-24-1.png" width="100%" style="display: block; margin: auto;" /> ] --- class: inverse middle center # However... --- class: middle center # Bar plots are commonly misued! --- # How NOT to Bar Plot <img src="data:image/png;base64,#images/how_not_to_barplot.PNG" width="50%" style="display: block; margin: auto;" /> * Krzywinski, M., & Altman, N. (2014). Visualizing samples with box plots. Nature methods, 11(2), 119-120. ??? * The following three figures are examples of how NOT to use a bar plots * Showing sample mean and SD or stand error are NOT recommended! * Depending on what you define as the baseline, they can drastically alter the appearance of the height of the bars as seen in the first and second figure. The first figure on the left is using 0 as a reference, where the middle figure is using 0.5 as the reference. * Creating breaks in the Y axis as seen on the third figure will also distort our data. --- # How NOT to Bar Plot .pull-left[ * Although frequently found and prevalent in the literature, this is NOT to be used to describe mean and dispersion (continuous data) * Only shows one arm of the error bar, making overlap comparisons difficult * Promotes misconception of the mean being related to its height rather the position of the top of the bar * Obscures the distribution and spread of the data ] .pull-right[ <img src="data:image/png;base64,#descriptive_statistics_files/figure-html/unnamed-chunk-26-1.png" width="100%" /> ] --- # Box plot .pull-left[ * Useful for describing continuous variables following a uni-modal distribution - e.g. a single peak * The box is representative of common quantitative measures - Top of box is the 75th quantile - Middle dash inside box is the 50th quantile - Bottom of box is the 25th quantile - Width of the box is the interquartile range (IQR) * The 'whiskers' are artificial 'fences' that helps identify potential outliers in the data - Defined as Q1 - 1.5\*IQR and Q3 + 1.5\*IQR ] .pull-right[ <img src="data:image/png;base64,#descriptive_statistics_files/figure-html/unnamed-chunk-27-1.png" width="80%" style="display: block; margin: auto;" /> <img src="data:image/png;base64,#images/boxplot_explained.png" width="80%" style="display: block; margin: auto;" /> ] --- class: inverse center middle # What are some of the problems with a box plot? --- class: center middle # They are based on quantitative summaries! --- # Box plot <img src="data:image/png;base64,#images/BoxViolinSmaller.gif" width="100%" /> ??? * As we learned from the beginning of the talk, quantitative summaries can be deceiving * The left figure is the 'Raw Data', the middle figure are box plots, and the right figure are something called 'violin plots' * These plots are called 'violin plots' because, as you guessed, they resemble violins. * As you can see on the left figure, as the raw data gets distorted, we still have identical box plots in the middle, since the properties of a box plot are based on quantitative summaries. It's possible to have identical quantitative summaries, that appear drastically different which the box plot was not able to capture. * Violin plots are able to capture the change in the data that was just not possible using a simple box plot. --- # Violin plot .pull-left[ * Violin plots are box plots, with an overlay of the density distribution (histogram) of the data * More informative than a simple box plot * Visualizes the full distribution of the data * Especially useful for bimodal or multimodal distribution * e.g. distribution of data with multiple peaks ] .pull-right[ <img src="data:image/png;base64,#descriptive_statistics_files/figure-html/unnamed-chunk-30-1.png" width="100%" /> ] --- # How are violin plots made? <img src="data:image/png;base64,#images/how_to_make_violin_plots.PNG" width="100%" style="display: block; margin: auto;" /> * Hintze, J. L., & Nelson, R. D. (1998). Violin plots: a box plot-density trace synergism. The American Statistician, 52(2), 181-184. ??? * Violin plots are an amalgamation of a box plot and a histogram --- <img src="data:image/png;base64,#images/jnci_publication_image.PNG" width="100%" style="display: block; margin: auto;" /> ??? * I wanted to briefly touch upon the usage of violin plots in the literature and highlight some of our ongoing research that's led by the director of the biostatistic core - Dr. Andre Rogatko * We are using data from the NSABP R-04 trial - Neoadjuvant chemotherapy in rectal cancer * In this trial - Rectal cancer is treated with presurgical radiation to downstage the tumor, reduce the risk of local recurrence and improve survival * R04 first tested infusional 5FU (standard) vs. capecitabine (an oral drug), and then 2x2 factorial design, with or without oxaliplatin * The results of the analysis of the NSABP R-04 trial using the toxicity index is detailed in the following article * In short, we reanalyzed the clinical trial data using the toxicity index as developed by Rogatko et al. and compared it to using the current standard of assessing toxicity using the maximum reported grade. * I also wanted to note that several of the coauthors on this list are also lecturers of this series * Rogatko, A., Babb, J. S., Wang, H., Slifker, M. J., & Hudes, G. R. (2004). Patient characteristics compete with dose as predictors of acute treatment toxicity in early phase clinical trials. Clinical Cancer Research, 10(14), 4645-4651. --- class: <br> <br> <br> <br> .pull-left[ <img src="data:image/png;base64,#images/violin_plot_maxae_grade.PNG" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="data:image/png;base64,#images/violin_plot_ti.PNG" width="100%" style="display: block; margin: auto;" /> ] ??? * I wanted to highlight the usage of violin plots that were utilized here in describing the distribution of Max grade and the TI * Violin plots were able to highlight the multimodal distribution of both the Max grade and the TI that would not have been visible with just a simple box plot. * Results of the study found the toxicity index was more powerful than the max ae grade, and requiring smaller sample sizes to detect the same differences. --- # Summary .pull-left[ * One continuous variable - Dot plot - Histogram - Box plot - Violin plot * One categorical variable - Bar plot * Two continuous variable - Scatter plot * One continuous by categorical variable - Dot plot - Box plot - Violin plot ] .pull-right[ <img src="data:image/png;base64,#images/scientific_paper_graph_quality.png" width="100%" /> ] ??? * To end I wanted to leave everyone with a general guide on how to visualize your data, this is by far not a comprehensive list --- class: center middle # Descriptive summaries are useful, however ... --- class: inverse center middle # Don't forget to visualize your data! --- class: center middle # Questions